In order to optimize dataframe:
df_num = df.select_dtypes(include=['int64','float64'])
converted_num = df_num.apply(pd.to_numeric,downcast='unsigned')
When we convert columns to category, it's important to be aware of trade-off:
gl_obj = gl.select_dtypes(include=['object']).copy()
converted_obj = pd.DataFrame()
for col in gl_obj.columns:
num_unique_values = len(gl_obj[col].unique())
num_total_values = len(gl_obj[col])
if num_unique_values / num_total_values < 0.5:
converted_obj.loc[:,col] = gl_obj[col].astype('category')
else:
converted_obj.loc[:,col] = gl_obj[col]
df['date'] = pd.to_datetime(date,format='%Y%m%d')
dtypes = optimized_gl.drop('date',axis=1).dtypes
dtypes_col = dtypes.index
dtypes_type=[i.name for i in dtypes.values]
column_types = dict(zip(dtypes_col, dtypes_type))
read_and_optimized = pd.read_csv('..\data\game_logs.csv',dtype=column_types,parse_dates=['date'],infer_datetime_format=True)
In [9]:
import os
import pandas as pd
In [10]:
# Load Data
gl = pd.read_csv('..\data\game_logs.csv')
# Available also at https://data.world/dataquest/mlb-game-logs
In [11]:
# Data Preview
gl.head()
Out[11]:
Original Article https://www.dataquest.io/blog/pandas-big-data/
In [14]:
gl.dtypes.head()
Out[14]:
In [37]:
# Select only the column with same type
gl.select_dtypes(include=['object']).head()
Out[37]:
In [16]:
#Exact amount of memory usage of df
gl.info(memory_usage='deep')
In [18]:
gl.describe()
Out[18]:
In [24]:
# Reference http://www.markhneedham.com/blog/2017/07/05/pandas-find-rows-where-columnfield-is-null/
# Columns with null values
null_columns=gl.columns[gl.isnull().any()]
gl[null_columns].isnull().sum()
Out[24]:
In [26]:
# Every row that contains at least one null value
print(gl[gl.isnull().any(axis=1)][null_columns].head())
Under the hood, pandas groups the columns into block of values of the same type:
Because each data type is stored separately, we examine the memory usage by data type.
Average memory usage for data type
In [62]:
gl.dtypes.value_counts()
Out[62]:
In [60]:
for dtype in gl.dtypes.unique(): #['float','int64','object']:
selected_dtype = gl.select_dtypes(include=[dtype])
mean_usage_b = selected_dtype.memory_usage(deep=True).mean()
mean_usage_mb = mean_usage_b / 1024 ** 2
print("Average memory usage for {} columns: {:03.2f} MB".format(dtype,mean_usage_mb))
Under the hood pandas represents numeric values as NumPy ndarrays and stores them in a continuous block of memory. This approach:
Many types in pandas have multiple subtypes that can use fewer bytes to represent each value. For example, the float type has the float16, float32, and float64 subtypes.
To discovare the range of values of given dtype we can use https://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.iinfo.html
In [63]:
import numpy as np
int_types = ["uint8", "int8", "int16"]
for it in int_types:
print(np.iinfo(it))
In [64]:
# We're going to be calculating memory usage a lot,
# so we'll create a function to save us some time!
def mem_usage(pandas_obj):
if isinstance(pandas_obj,pd.DataFrame):
usage_b = pandas_obj.memory_usage(deep=True).sum()
else: # we assume if not a df it's a series
usage_b = pandas_obj.memory_usage(deep=True)
usage_mb = usage_b / 1024 ** 2 # convert bytes to megabytes
return "{:03.2f} MB".format(usage_mb)
In [65]:
mem_usage(gl)
Out[65]:
In [68]:
gl_int = gl.select_dtypes(include=['int64'])
converted_int = gl_int.apply(pd.to_numeric,downcast='unsigned')
In [69]:
print(mem_usage(gl_int))
print(mem_usage(converted_int))
In [70]:
compare_ints = pd.concat([gl_int.dtypes,converted_int.dtypes],axis=1)
compare_ints.columns = ['before','after']
compare_ints.apply(pd.Series.value_counts)
Out[70]:
We can see a drop from 7.9 to 1.5 MB. Now we have 5 uint8 and 1 unit32 instead of 6 int64.
Lets do the same thing with float columns
In [72]:
gl_float = gl.select_dtypes(include=['float'])
converted_float = gl_float.apply(pd.to_numeric,downcast='float')
print(mem_usage(gl_float))
print(mem_usage(converted_float))
compare_floats = pd.concat([gl_float.dtypes,converted_float.dtypes],axis=1)
compare_floats.columns = ['before','after']
compare_floats.apply(pd.Series.value_counts)
Out[72]:
All our float columns were converted from float64 to float32, give us a 50 reduction in memory usage.
We apply this optimizations on the entire df
In [73]:
optimized_gl = gl.copy()
optimized_gl[converted_int.columns] = converted_int
optimized_gl[converted_float.columns] = converted_float
print(mem_usage(gl))
print(mem_usage(optimized_gl))
In order to have the best benefit, we have to optimize the object types.
The object type represents values using Python string objects, partly due to the lack of support for missing string values in NumPy. Because Python is a high-level, interpreted language, it doesn’t have fine grained-control over how values in memory are stored.
This limitation causes strings to be stored in a fragmented way that consumes more memory and is slower to access. Each element in an object column is really a pointer that contains the “address” for the actual value’s location in memory.
Object types as using a variable amount of memory. While each pointer takes up 1 byte of memory, each actual string value uses the same amount of memory that string would use if stored individually in Python. Let’s use sys.getsizeof() to prove that out, first by looking at individual strings, and then items in a pandas series.
In [74]:
from sys import getsizeof
s1 = 'working out'
s2 = 'memory usage for'
s3 = 'strings in python is fun!'
s4 = 'strings in python is fun!'
for s in [s1, s2, s3, s4]:
print(getsizeof(s))
In [76]:
obj_series = pd.Series(['working out',
'memory usage for',
'strings in python is fun!',
'strings in python is fun!'])
obj_series.apply(getsizeof)
Out[76]:
You can see that the size of strings when stored in a pandas series are identical to their usage as separate strings in Python.
The category type uses integer values under the hood to represent the values in a column, rather than the raw values. Pandas uses a separate mapping dictionary that maps the integer values to the raw ones. This arrangement is useful whenever a column contains a limited set of values. When we convert a column to the category dtype, pandas uses the most space efficient int subtype that can represent all of the unique values in a column.
In [78]:
#Where we migh be able to reduce memory?
gl_obj = gl.select_dtypes(include=['object']).copy()
gl_obj.describe()
Out[78]:
A quick glance reveals many columns where there are few unique values relative to the overall ~172,000 games in our data set. We start to convert day_of_week to category using the .astype(method).
In [79]:
dow = gl_obj.day_of_week
print(dow.head())
dow_cat = dow.astype('category')
print(dow_cat.head())
In [80]:
# We can see the integer values associated to column
dow_cat.head().cat.codes
Out[80]:
In [81]:
# We compare the memory usage
print(mem_usage(dow))
print(mem_usage(dow_cat))
When we convert columns to category, it's important to be aware of trade-off:
In [82]:
converted_obj = pd.DataFrame()
for col in gl_obj.columns:
num_unique_values = len(gl_obj[col].unique())
num_total_values = len(gl_obj[col])
if num_unique_values / num_total_values < 0.5:
converted_obj.loc[:,col] = gl_obj[col].astype('category')
else:
converted_obj.loc[:,col] = gl_obj[col]
In [83]:
print(mem_usage(gl_obj))
print(mem_usage(converted_obj))
compare_obj = pd.concat([gl_obj.dtypes,converted_obj.dtypes],axis=1)
compare_obj.columns = ['before','after']
compare_obj.apply(pd.Series.value_counts)
Out[83]:
In [84]:
# Now we combine with the rest of our dataframe (numeric columns)
optimized_gl[converted_obj.columns] = converted_obj
mem_usage(optimized_gl)
Out[84]:
In [85]:
date = optimized_gl.date
print(mem_usage(date))
date.head()
Out[85]:
We’ll convert using pandas.to_datetime() function, using the format parameter to tell it that our date data is stored YYYY-MM-DD.
In [86]:
optimized_gl['date'] = pd.to_datetime(date,format='%Y%m%d')
print(mem_usage(optimized_gl))
optimized_gl.date.head()
Out[86]:
we can specify the optimal column types when we read the data set in. The pandas.read_csv() function has a few different parameters that allow us to do this. The dtype parameter accepts a dictionary that has (string) column names as the keys and NumPy type objects as the values.
In [93]:
dtypes = optimized_gl.drop('date',axis=1).dtypes
dtypes.head()
Out[93]:
In [94]:
dtypes_col = dtypes.index
dtypes_col
Out[94]:
In [103]:
dtypes_type=[i.name for i in dtypes.values]
In [104]:
column_types = dict(zip(dtypes_col, dtypes_type))
In [117]:
#Preview of first 10
{k:v for k,v in list(column_types.items())[:10]}
Out[117]:
Now we can use the dictionary, along with a few parameters for the date to read in the data with the correct types in a few lines:
In [119]:
read_and_optimized = pd.read_csv('..\data\game_logs.csv',dtype=column_types,parse_dates=['date'],infer_datetime_format=True)
print(mem_usage(read_and_optimized))
read_and_optimized.head()
Out[119]:
In [136]:
import matplotlib.pyplot as plt
optimized_gl['year'] = optimized_gl.date.dt.year
game_lengths = optimized_gl.pivot_table(index='year', values='length_minutes')
game_lengths.reset_index().plot.scatter('year','length_minutes')
plt.show()